19 research outputs found
Detect-and-Track: Efficient Pose Estimation in Videos
This paper addresses the problem of estimating and tracking human body
keypoints in complex, multi-person video. We propose an extremely lightweight
yet highly effective approach that builds upon the latest advancements in human
detection and video understanding. Our method operates in two-stages: keypoint
estimation in frames or short clips, followed by lightweight tracking to
generate keypoint predictions linked over the entire video. For frame-level
pose estimation we experiment with Mask R-CNN, as well as our own proposed 3D
extension of this model, which leverages temporal information over small clips
to generate more robust frame predictions. We conduct extensive ablative
experiments on the newly released multi-person video pose estimation benchmark,
PoseTrack, to validate various design choices of our model. Our approach
achieves an accuracy of 55.2% on the validation and 51.8% on the test set using
the Multi-Object Tracking Accuracy (MOTA) metric, and achieves state of the art
performance on the ICCV 2017 PoseTrack keypoint tracking challenge.Comment: In CVPR 2018. Ranked first in ICCV 2017 PoseTrack challenge (keypoint
tracking in videos). Code: https://github.com/facebookresearch/DetectAndTrack
and webpage: https://rohitgirdhar.github.io/DetectAndTrack
HierVL: Learning Hierarchical Video-Language Embeddings
Video-language embeddings are a promising avenue for injecting semantics into
visual representations, but existing methods capture only short-term
associations between seconds-long video clips and their accompanying text. We
propose HierVL, a novel hierarchical video-language embedding that
simultaneously accounts for both long-term and short-term associations. As
training data, we take videos accompanied by timestamped text descriptions of
human actions, together with a high-level text summary of the activity
throughout the long video (as are available in Ego4D). We introduce a
hierarchical contrastive training objective that encourages text-visual
alignment at both the clip level and video level. While the clip-level
constraints use the step-by-step descriptions to capture what is happening in
that instant, the video-level constraints use the summary text to capture why
it is happening, i.e., the broader context for the activity and the intent of
the actor. Our hierarchical scheme yields a clip representation that
outperforms its single-level counterpart as well as a long-term video
representation that achieves SotA results on tasks requiring long-term video
modeling. HierVL successfully transfers to multiple challenging downstream
tasks (in EPIC-KITCHENS-100, Charades-Ego, HowTo100M) in both zero-shot and
fine-tuned settings.Comment: CVPR 202
Video Action Transformer Network
We introduce the Action Transformer model for recognizing and localizing
human actions in video clips. We repurpose a Transformer-style architecture to
aggregate features from the spatiotemporal context around the person whose
actions we are trying to classify. We show that by using high-resolution,
person-specific, class-agnostic queries, the model spontaneously learns to
track individual people and to pick up on semantic context from the actions of
others. Additionally its attention mechanism learns to emphasize hands and
faces, which are often crucial to discriminate an action - all without explicit
supervision other than boxes and class labels. We train and test our Action
Transformer network on the Atomic Visual Actions (AVA) dataset, outperforming
the state-of-the-art by a significant margin using only raw RGB frames as
input.Comment: CVPR 201